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Abstract 

The circadian clock is an important molecular mechanism that enables many organisms to anticipate and adapt to 
environmental change. Pokhiiko et ol. recently built a deterministic ODE mathematical model of the plant circadian 
clock in order to understand the behaviour, mechanisms and properties of the system. The model comprises 
30 molecular species (genes, mRNAs and proteins) and over 100 parameters. The parameters have been fitted 
heuristically to available gene expression time series data and the calibrated model has been shown to reproduce 
the behaviour of the clock components. Ongoing work is extending the clock model to cover downstream effects, 
in particular metabolism, necessitating further parameter estimation and model selection. This work investigates the 
challenges facing a full Bayesian treatment of parameter estimation. Using an efficient adaptive MCMC proposed 
by Haario et ol. and working in a high performance computing setting, we quantify the posterior distribution 
around the proposed parameter values and explore the basin of attraction. We investigate if Bayesian inference is 
feasible in this high dimensional setting and thoroughly assess convergence and mixing with different statistical 
diagnostics, to prevent apparent convergence in some domains masking poor mixing in others. 



Introduction 

The circadian clock is a molecular mechansism that syn- 
chronises biological processes with the day/night cycle 
and is found in many organisms [1]. The presence of a 
clock enables an organism to anticipate and adapt to 
environmental change and hence use energy sources 
more efficiently. The mechanism includes interlocked, 
transcriptional feedback loops. Pokhiiko et aL built a 
mathematical model of the plant circadian clock in 
order to understand the behaviour, mechanisms and 
properties of the system [2]. Recent experiments show 
that three plant-specific proteins ELF3, ELF4 and LUX 
form an evening complex (EC) which binds to the pro- 
moters of target genes [3]. This has led to a revision of 
the model [4]. The latest version of the clock model, 
illustrated in Figure 1, represents interconnected morn- 
ing and evening loops in a three loop structure. The 
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morning loop comprises transcription factors LHY and 
CCA 7, which activate the expression of PRR9 PRR7 and 
PRRS/NL The transcriptional co-regulators PRR9, PRR7 
and PRR5 inhibit LHY and CCAl expression by binding 
to their promoters. The evening loop, previously repre- 
sented by TOC and a hypothetical gene (Y), introduced 
by Locke et al [5] is now represented by TOC, LUX, 
ELF3 and ELF4. 

Parameter values used in these studies were either con- 
strained (based on the available data) or fitted to multiple 
time series data sets (for full details see Supplementary 
Table SI in [4]). Previously, Locke et al constructed a cost 
function to quantif)^ the agreement between an earlier ver- 
sion of the model and various key experimental features 
[5]. They undertook an efficient global search of parameter 
space and showed that this optimized solution fits several 
but not all the experimental features. 

The aim of this work is to set the scene for full Baye- 
sian inference of the model parameters using state-of- 
the-art MCMC techniques. We explore the landscape of 
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Figure 1 Outline of the Arabidopsis circadian clocl^ based on Figure 1 in [4]. The morning and evening loop elements are represented by 
white and grey boxes respectively. The solid lines indicate transcriptional regulation and the short dashed lines indicate post-translational 
regulation of TOCl and EC by Gl, ZTL and COPl. An arrow signifies activation and a block inhibition. The EC protein complex formation is 
denoted by the short-long dashed line. Flashes represent acute light responses and asterisks post-translational regulation by light. 



the posterior distribution around the model solution and 
investigate the challenges in finding the posterior distri- 
bution from increasingly distant initial starting positions 
in parameter space. We consider the implications of our 
findings on the design of efficient parameter estimation 
schemes for components of the circadian clock. 

Methods 

Mathematical model and data 

We apply the methods to synthetic data generated from 
the model ODEs, described by Pokhilko et al, [4] where 
the actual parameter values are known. We use the latest 
version of the model which comprises the evening com- 
plex and a light function mimicking 12 hours day and 12 
hours night. The model comprises 28 species (i.e. mRNAs 
and proteins) and an additional two species are used for 
fitting purposes, in total 30 species. There are 103 
unknown parameters. Typical differential equations for 
the dimensionless concentrations of ELF4 mRNA, c^^, 
suppressed by EC and LUX proteins, Cec ^nd Cmx > 
respectively; and EFL4 protein, C£4, modified by ELF3 
nuclear protein, c^s^, and ELF3-ELF4 nuclear protein 
complex, C£34„, are: 



dd 



£4 _ ^ ^2 ^6 

dt gl + Cec gi + q 



(1) 



dC: 



E4: ffl /<-)\ 

= p23<^£4 — ^35<^£4 " p25<^£4<^£3n + p21<^£34n v^>' 

where riis, ^3435, ^5^21,23,25 and ^2,6 represent the rate 
constants of transcription, degradation, protein translation 



/modification /complex formation /translocation between 
nucleus and cytoplasm and Michaelis-Menton constants, 
respectively. The full set of differential equations can be 
found in the supplementary material in Pokhilko et al. [4]. 

The use of synthetic data in this way enables us to assess 
the accuracy of the inference prediction. White Gaussian 
noise was added to this time series data to obtain a signal 
to noise ratio (SNR) of around 100. Fifty time points, one 
every hour over a two day period was considered realistic 
and suitably rich to capture key features of the data, 
namely the oscillatory period, amplitude and phase. 

Parameter estimation 

We derived posterior distributions for the model para- 
meters under a Bayesian framework using the efficient 
adaptive Markov chain Monte Carlo (MCMC) algorithm 
described by Haario et aL, implemented using their 
MATLAB code [6] . This method uses a multivariate Gaus- 
sian proposal to move the exploratory chains through pos- 
terior spaces which may contain ridges or other 
challenging features - very likely when, as in our case, the 
number of parameters is high and/or parameters are 
correlated. 

The algorithm described above requires a model defined 
sum-of-squares function and assumes additive i.i.d. Gaus- 
sian errors for the observations. Letting y/ (6) be the ODE 
solution for a set of initial conditions and parameters, 6, 
then, in our case, the sum of squares function is 



(3) 
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where ^n(^) is the model output corresponding to the 
nth. data point, yn- This score is equivalent to the nega- 
tive log likelihood under a homoscedastic additive i.i.d. 
Gaussian noise model [7]. For the prior, we assumed an 
improper uniform distribution. This is the worst-case 
scenario corresponding to the complete absence of com- 
plementary biological information, which was chosen 
deliberately so as to obtain a conservative lower bound 
on both the parameter estimation accuracy as well as 
the rate of convergence. 

The MCMC chains should be run until they have satis- 
factorily converged. A suitable number of steps is not 
typically known in advance and hence convergence diag- 
nostics are used to monitor convergence. Our chains 
were run for approximately 10^ iterations. Four sub 
chains were used to establish the potential scale reduc- 
tion factor (PSRF). The PSRF is a ratio reflecting the 
between chain variance and the within chain variance. 
This and other methods are discussed in [8]. If the PSRF 
is large then either our estimate of variance can be 
decreased by more simulations or the within chain var- 
iance will increase since the simulated sequences have 
not yet made a full tour of the target distribution. Gener- 
ally a value of 1.1 is taken to indicate reasonable confi- 
dence that the chains have converged. PSRF is estimated 
in parameter space but we also use it to consider whether 
the sum-of-squares function, equation 3 has converged in 
data space. The multivariate analogue (MPSRF), essen- 
tially an upper bound of PSRF, was used to consider con- 
vergence at the higher order [8]. Posterior measures, 
including mean and variance are based on every 100th 
iteration of the last 10^ iterations, pruned to reduce levels 
of autocorrelation. 

Obtaining the numerical solution of the ODE, y/n{0), is 
an expensive component of the overall computational task 
so we accelerated its calculation using high speed ODE 
simulations available from SBTOOLBOX2 and SBPD [9]. 
On average this made the calculations between 10 and 20 
times faster than those based on MATLAB's built-in ODE 
solvers. Computations were designed to run on a multi- 
nodal cluster using MATLAB's parallelization facilities. 

Results 

Quantification of parameter posteriors starting at true 
parameter values 

In our first test, denoted Experiment 1, we start the 
MCMC exploratory chains at the model's true values 
and observe how far the chains will travel before they 
reach a stationary phase. The Euclidean distance 
between the posterior mean and the true value, for each 
parameter, is indicative of the attractive pull of the 
model solution. The variance of the posterior distribu- 
tion summarizes the inherent uncertainty in the system. 
Noise in the system, arising from measurement errors, 



may mean that the maximum a posteriori does not lie 
directly over the true value. 

After 10^ iterations, the sum-of-squares plots, tracing 
the fit of the model to the data, are indicated to have 
converged (mean PSRF = 1.01), see Table 1. However the 
multivariate statistic (MPSRF = 1.81) indicates that the 
trace plots are still changing with implications for 
the accuracy of the fit of the model between species. 
Concerning parameter space, 99% of parameters show no 
absence of convergence as indicated by PSRF<1.1 (mean 
PSRF = 1.02) and all the posterior estimates lie within 
the 5^^-95^^ percentile of the posterior distribution. The 
Euclidean distance in parameter space between the 
MCMC chain start and the true value (denoted EDq) is 
zero and the Euclidean distance between the posterior 
mean and the true value is 0.25 (denoted ED^). This is 
our first indication of how far it is possible to perturb the 
parameters before they lie outside the basin of attraction. 

Perturbance of the starting position further defines 
attraction of the posterior basin 

Next, we perturbed the starting point of the exploratory 
adaptive MCMC chains by sampling from a Normal dis- 
tribution centred on the true value with variance 
increasing in small steps from variance= 0.005^ (corre- 
sponding to EDq = 0.06) to variance= 0.1^ (EDq = 0.98). 
We denote these experiments as Experiments 2 to 10, 
see Table 1. We then tried choosing initial starting para- 
meters from a Gamma distribution, T (1, 2) and T (2, 4), 
denoted Experiments 11 to 13. For all but two experi- 
ments where variance were less than 0.015^ (Experi- 
ments 1 to 3) we took the initial start from inferring the 
initial condition to be the observed level in our data, see 
Table 1. 

Generally as the initial starting value is perturbed the 
mean PSRF value for the sum-of-squares trace plots 
increases and the number of species converging reduces 
(Columns 3 and 4, Table 1). Except for the first three 
experiments, with no or little perturbation, ED^ (Column 
10, Table 1) is comparable with the EDq (Column 2, 
Table 1), suggesting that the MCMC parameter chains 
are not venturing far from the true values. However by 
Experiment 9 the perturbation is such that over 10% of 
the parameters are converging to posterior means that 
are significantly far away from the true value (i.e. true 
value lies outside the 5^^-95^^ percentile posterior distri- 
bution). This could be indicative of alternative parameter 
regimes giving rise to the observed data and can be tested 
by inspecting the log likelihood. 

Parameter convergence is not correlated to Euclidean 
distance between starting value and true value 

As mentioned above, convergence in data space at the 
species level decreases with distance, most notably as 



Table 1 Summary of convergence diagnostics in parameter and data space 
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EDq crosses ED^, Hence we checked whether conver- 
gence in parameter space was correlated to initial start 
values. We found that neither EDq nor the percentage 
of EDq to the true value is significantly correlated {P > 



0.05) with the PSRF for Experiments 1-7. Examination 
of the marginal posterior distributions for the five para- 
meters with the lowest PSRFs and the five highest in 
Experiment 6 (Figure 2) illustrates that recovery of the 
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Figure 2 Posterior probability density estimates for parameters with low end PSRFs (left column) and parameters with high end 
PSRFs (right column). True parameter values are indicated by a cross on tine x-axis. 
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true parameter values may not be controlled directly by 
convergence diagnostics. 

The reason why some parameters are not converging 
appears parameter specific and possibly related to indivi- 
dual levels of inter-correlation with other parameters or 
species. To test this hypothesis we investigated what hap- 
pens if we remove highly correlated parameters from the 
analysis. We saw a reduction in the MPSRF (e.g. removing 
parameters with a correlation coefficient of 0.25 or above 
would reduce the MPSRF from 3.38 to 1.02 in Experiment 
2) but more work will be required to systematically expose 
patterns of inter-correlations in this complex network of 
30 species and over 100 parameters. 

Conclusions 

The aim of this work is to set the scene for full parameter 
estimation and model comparison in a Bayesian context 
for the circadian clock model. Previously, model para- 
meters have been fitted heuristically [2,4,10]. Those model 
calibration exercises reproduce features of the data but 
cannot rule out other parameter regimes. As the clock 
continues to be extended to other species and to down- 
stream activities such as metabolism it becomes increas- 
ingly important to evaluate competing scenarios, and the 
Bayesian approach extends naturally to model comparison. 
This work represents the first full Bayesian treatment of 
parameter estimation for the circadian clock in Arabidop- 
sis Thaliana and informs future work for tackling this 
complex problem. 

Our initial investigation highlights two main areas. First, 
we have shown that modern MCMC techniques, when 
implemented in a high performance computing environ- 
ment, make it feasible to attempt Bayesian inference in 
this high- dimensional setting. 

Second, apparent convergence in either data or para- 
meter space, using diagnostic techniques, may mask poor 
mixing, both pairwise and at higher orders of the explora- 
tory chains. This issue requires further investigation of the 
proposal function but also better coverage of the prior 
parameter space with a population of chains. For the 
simulation experiments described in the present article, 
we took the most unfavourable scenario of complete 
absence of prior information about the chemical kinetic 
parameter values, for which we chose improper uniform 
prior distributions. For most practical applications, more 
informative priors are usually available, derived from 
expert elicitation, the biological literature, databases such 
as KEGG (Kyoto Encyclopedia of Genes and Genomes), 
and complementary experiments. We note that more 
informative priors can not only potentially lead to an 
improvement in the parameter estimation accuracy, but 
also to an improvement in the convergence of the Markov 
chains, due to the fact that they render the posterior distri- 
butions less diffuse. The estimates presented in the present 



study can therefore be regarded as conservative, providing 
performance indicators that, in practice, can potentially be 
improved on. Exploration of the parameter space could 
also be directed by introducing auxiliary information, in a 
systematic fashion, specific to the circadian model, such as 
period/amplitude or phase of the data. These considera- 
tions will allow for greater confidence in the predictions 
and fuller understanding of the model performance in dif- 
ferent parameter regimes. 
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